-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Bugfix] Guard for negative counter metrics to prevent crash #10430
[Bugfix] Guard for negative counter metrics to prevent crash #10430
Conversation
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
4102ffd
to
338740a
Compare
Not sure if it's worth adding a test to |
Let's fix #6325 in another PR. |
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com>
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: rickyx <rickyx@anyscale.com>
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
…oject#10430) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
I'm not sure how it happens, but we have observed crashes when running vLLM in online model due to a negative value being sent to increment a Prometheus counter:
This PR adds a check on the value of the counter before calling the prometheus client to avoid the crash, but the root cause of the negative value needs more investigation.
FIX #6642
#6325 is related and shows the same error.